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ABSTRACT 

A  random  sample  of  size     N     is   divided  into     k     clusters   that  mini- 
mize  the  within   clusters  sum  of  squares   locally.     Some  large  sample 
properties   of  this   k-means   clustering  method  (as     k     approaches     °°     with 
N)     are  obtained.      In  one  dimension,   it  is  established  that  the  sample 
k-means   clusters   are  such   that  the  wi thin-cluster  sums  of  squares  are 
asymptotically  equal,   and  that  the  sizes   of  the  cluster  intervals  are 
inversely  proportional    to   the  one-third  power  of  the  underlying  density  at 
the  midpoints  of  the  intervals.     The  difficulty  involved  in  generalizing 
the  results  to  the  multivariate  case  is  mentioned. 
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1.  INTRODUCTION 

Let  the  univariate  observations     x-,,  X2,    . . . ,  x^,     be  sampled  from 
a  distribution     F    with  density  function     f.     Suppose  that  these  observa- 
tions are  partitioned  into     k     groups  with  means     z,,   ...,  z.      such  that 
no  movement  of  an  observation  from  one  group  to  another  will   reduce  the 
within  groups  sum  of  squares 

WSS.,  =     Z       min  x.   -  z  •      I    . 

^       i=l   l<j<k   "     ^         J      ' 

This  method  for  division  of  a  sample  into     k     groups  to  minimize  the  with- 
in groups  sum  of  squares   locally  is  known  in  the  clustering  literature 
as   k-means.     In  one  dimension,   the  partition  will   be  specified  by     k-1 
cutpoints;   the  observations   lying  between  common  outpoints  are  in  the 
same  group.     See  Hartigan  (1975)   for  a  detailed  description  of  the  k-means 
method,  and  see  Hartigan  and  Wong  (1979)   for  an  efficient  computational 
algorithm.     This  method  has  been  widely  used  in  various  clustering 
applications   (see  Blashfield  and  Aldenderfer,   1978).     Its  asymptotic 
properties  (when  N  ^  <=°)   for  fixed     k     have  been  studied  by  MacQueen  (1967), 
Hartigan  (1978),   and  Pollard  (1979).     Here,   the  sampling  properties  of 
k-means  clusters  when     k     approaches     °°    with     N     are  presented. 

The  properties  of  the  univariate  population  k-means  clusters  when 
k  ->  CO     are  given  in  Wong  (1982a).     It  is  shown  that  for  large     k,     the 
optimal   population  partition  is  such  that  the  wi thin-cluster  sums  of 
squares  are  equal,  and  that  the  sizes  of  the  cluster  intervals  are 
inversely     proportional    to  the  one-third  power  of  the  underlying  density 
at  the  midpoints  of  the  intervals.     In  this  paper,  non-standard  asymptotics 
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are  used  to  obtain   the  asymptotic  properties   (when     k  ^  °°     with     N)   of 
the   locally  optimal    k-means  clusters   for  samples   from  a  general    popula- 
tion    F     on   [0,1];    in  particular,   it  is   shown  that  the  locally  optimal 
partition  approaches   the  population  optimal    partition  under  certain   re- 
gularity conditions.     The  special    difficulties   in  showing   this    result 
are:      (1)    the   number  of  clusters     k     approaches     °°     with     H     (such   that 
the   length  of  each  cluster  interval    approaches   zero  while  the  number  of 
observations   in  each   cluster  approaches   infinity),   and  (2)    the   locations 
and  sizes   of  the  cluster  intervals   are  determined  by  an  optimization  pro- 
cedure  (so  all    results   concerning   the  clusters   need  to  hold  uniformly 
for  all    clusters) . 

Theorem  1  and  Theorem  2,    respectively,   gives   the  asymptotic  expres- 
sion for  the   lengths   and  the  within-cluster  sums  of  squares  of  the 
locally  optimal    k-means   clusters.     The  result  of  Theorem    1   is   obtained 
in  Section  2,   and  Theorem  2   is   proved  in  Section   3.     Some  concluding 
remarks  are  given   in  Section  4,   in  which   the   difficulties   involved  in 
generalizing  these   univariate   results   to  many  dimensions   are  also  mentioned. 


-2- 


2.     ASYMPTOTIC  LENGTHS  OF  THE  LOCALLY  OPTIMAL 
SAMPLE   K-MEANS   CLUSTERS 


Let     X-,,   Xp,    . . . ,   X|M     be  a  random  sample  from  a  density     f    which   is 

positive  and  has   four  bounded  derivatives   in    [0,1].      Denote  the  i th 

derivative  of     f     at     x     by     f^^^x),     and  let     B  =  q^J^P^   f(x)     and 

b   =  p.       ,    f(x).     Suppose  that  the     N     observations  are   grouped  into 

k^i     clusters  with  means     z,<Z2  <...<     z^       so  that  the  within  clusters 

um  of  squares  of  this   locally  optimal      k^,-parti tion  cannot  be  decreased 

by  moving  any  single  observation  from  its   present  cluster  to  any  other 

cluster.     Denote  the     pth     order  statistic  by     X/    i    >     and  let     n-     be  the 

number  of  observations   in   the     jth     cluster.     Then     x    ■   ,  -^   . . .' < 

(Jp  n,+l) 
t=0     ^ 

x      .  are   the  observations   in   the  jth   cluster,  where   n|-,=0.     And  the 

(   2   n   ) 

t=0   ^ 

length  e.  and  the  midpoint  m.  of  the  jth  cluster  are   defined  to  be 

J  J 

[x     .  X/ .   ,         ]     and     1/2   [x,    .•  +  x    .   ,         ]       respective- 

(   ^z     n^n)   -     ^J-^  n  )  ^  i     n  +1)         (^l^  n^) 

t=o  t=0  t=0     ^  t-0 

ly,  where     x^^^    =  0     and     x^j^^^^   =  1. 


There  may  be  many  locally  optimal    k-maans   partitions.     Theorem   1 

1/3 
states   that  for  any  such   partition,   if     k^,  =   o([N/logN]        ),   then 


max      Ik^  e.   f(m.)^/^  -  A  f{x)^^^  dx |   =  o   (1). 
V-J^k^     N     J         J  U  p 


To  show  the  result  of  Theorem  1,  we  need  a  few  lemmas.  Lemma  1  is  a  direct 
consequence  of  a  theorem  of  large  deviations  given  in  Feller  (1971).  It 


is   useful    in  proving  Lemma  2  which  states   that  if  the       n.'s     are   large 
enough,    the  cluster  means  are  suitably  close  to  the  cluster  midpoints. 
This   result  is    then  used  in  the  proof  of  Theorem  1   to  determine   the 
relationship  between  the   lengths   of  neighboring  clusters. 

Lemma  1: 

Let     X-,,   Xp,    ...,   X|u     be  a   random  sample   from  a   distribution     F     with 
density     f     which   is   positive  and  has   four  bounded  derivatives   in    [0,1]. 
Put     B   =  J^^ ,    fix)     and     b  =  ^''^   ,    f(x).     Denote  the     n     observations 
contained  in   an  open  subinterval      I     of   [0,1]   by     Zp   z^j    • • • ,   z„>     and 
let     E[z.-Uj]    =  0     and     E[(z.-Uj)^]    =  a^. 

Then  there  exist  constants     C,     D,     and     N       not  depending  on     Uj 
and     oj      (or     I)     such   that  if     N^N^     and     n^N(log  N/N)^/^b/16, 


Proof: 

From  the  theorem  of  large  deviations   given   in   Feller  (1971,   p.    549), 

since   (41ogN)  n~         ^0     as     n  ->  °°,     we  have 

D   ;   -1     1/2    1^1  +  . ..+^n     -   uJ    >   (41ogN)^/2}/(2ll)"^/^(41ogN)"^/2N"2-l 

as     n  ^  °=. 

Now  since  f  has  bounded  derivatives  and  Ot  n   I '-^ "  ^^t  I 

1      '    n       '■ 

has  a  distribution  not  depending  on  Uj  and  a.  (or  I),  the  lemma 

follows.  (In  the  proof  of  Theorem  1,  it  will  be  shown  that  when  N  is 

1/3 
large  enough  and  if  k^,  =  o  ([N/logN]  '    ),  then  each  of  the   k^,  clusters 

1/3 
contains  at  least  N(logN/N)  '    b/16  observations,  and  hence  the  result 

of  this  lemma  can  be  applied.) 


In  the  application  of  Lemma  1,  n  will  be  the  number  of  observa- 
tions in  an  open  subinterval   I  of  [0,1].  Therefore,  n  is  approxi- 
mately NF(I),  where  F(I)  =  /r  dF.  More  precisely,  using  Donsker's 
theorem  for  empirical  processes  (see  Billingsley  1968,  p.  141),  we  have 

^^P]  "j  -  NF(I)  I  =  0  (N^/2)  (2.1) 

where  ny   is  the  number  of  observations  in  I,  and  the  sup  is  taken 
over  all  open  subintevals  of  I.  Both  Lemma  1  and  Equation  (2.1)  will  be 
used  in  the  proof  of  Lemma  2  to  give  a  uniform  estimate  of  the  deviations 
between  cluster  means  and  midpoints  for  those  clusters  that  are  large 
enough. 

Lemma  2: 

Let  x-|,  Xo,  . . . ,  x^,  be  a  random  sample  from  F  whose  density  f 
is  positive  and  has  four  bounded  derivatives  in  [0,1].  Put 

B  =  ^"P.  f(x),  b  =  n"""^,  f(x),  and  F(I)  =  /.  dF.  Let  z.  be  the  mean 

0<Xil  Oi:Xi:l  I  1 

of  the  observations   in  an  open  interval      I,     lit   =  /t   xdF/F(I)     be  the 
conditional   mean  of     F     on     I,     and     Sj     be  the  size  of  the  interval      I. 
Then  there  exists   a   constant     C       such  that 
P^{   ^'f  ^i~'^^^\~^i   -   ^i\    ^  ^0   (logN/N)^''^}   =   1  -  o(l),     where  the  sup   is 

taken  over  all    open   intervals      I      (whose  boundary  ooints   are  order 

1/3 
statistics)   containing  at  least     N(logN/N)   '    b/16     observations. 

Proof: 


For  any  N^N  ,  consider  an  interval  I  of  the  form 

J  0 
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(X(p),   x^p^^  ^^p,     where     n^   2   N   (log  N/N)^/\/16. 

Using  Lemma   1   (first  conditioning  on  the  two  order  statistics   and 
then  integrating  out),  we  obtain 


p/aj^   rij^/^l    Zj   -   Uj    I    >  C(log  N)^/^}   <  DN"^   (log  N)"^/^  (2.2) 


2  2 

where     a.   =  fr    (x  -   Ur)      dF/F(I)    =  conditional   variance  of     F     on     I. 

Now,   by   the  Taylor  series  expansion  of  f,     f(x)   =  fim-r)   +   (x  -  mr)    f^    ' 

1  2(2) 

(fir)   +  T"(x-mj)  f       (q^^      ^^'^  ^^^      ^     ^'^     ^'     where     m.      is   the  midpoint 

of     I     and     q^^     is  between     x     and     m..     Therefore, 


F(I)    =  f(mj)   Sj    [l+0(s^)] 


1  ^    r  2  4 

^I   =  ^  ^T2   •     f(mj         •   'l   *  0(^1^'  ^"^ 


aj   =  y^  S^    [1  +  0(s^)] 


(Note   that  the  constant  in   the     0     term  depends   on  the  bound  of  the  second 
deri  vati  ve  of     f) . 

It  follows   that   (2.2)   can  be  written  as: 

P^  {    vl2     Sj^    [1  +  0(s^)]      nj^/2   I    5j   -   ujl    :.   CdogN)^/^} 
^  DN"^   (logN)"^/^ 


2 
Since  the  number  of  possible   intervals     I      is  bounded  by     N   ,     we  have 


P,  (   ^JPsjl    [1.0   (s2)]    n^l/2    l-zj    -uj    I    <     JLc(logN)l/2j 

/1 2 

^   1  -  D(1ogN)'^/^  =   1  -  o(l). 


Now  from  (2.1),  we  have  uniformly  in  I,     Ht   >-2NF(I)   >  j  Nb  Sr     with 
probability  tending  to  one. 
Therefore, 


p^  {Sup  3jl/2   |-^   _  ^^1    ^  ^^   (logN/N)^/^}  ^  1  as  N 


2      -1/2 
where     C„  =  —  b  C,     and  the  sup  is   taken  over  all    intervals  of  the 

form     (x^p),     X(p^^  ^^j)     with     n^   ^  N   ( logN/N) ^^^  b/16. 

The  result  of  Lemma  2  shows   that  if  the  k-means   clusters  are  large 

enough,   the  cluster  means   are  suitably  close   to   the  cluster  midpoints. 

When  combined  with  some  well-known  properties   of  locally  optimal    k-means 

partitions,    this   result  is   useful    in  establishing   the   relationship  between 

the   lengths  of  neighboring  clusters.     The  main  difficulty  to  be  overcome 

in  the  proof  of  Theorem  1   is  showing  that  all    the  clusters   are  large 

enough. 

Theorem  1: 

Let     X-, ,   Xp,    . . . ,   X[u     be  a   random  sample  from  a  distribution     F 
whose  density     f     is  positive  and  has   four  bounded  derivatives   in   [0,1]. 
Put     B  -  Q^JJP^  f(x)     and     b  =  Q^^^^f(x),     let     e^     ( j   =   1,   2,    . . . ,   k^) 

be  the  length  of  the     jth     cluster  of  a   locally  optimal     k^^^-parti  tion  of 

1/3 
the  sample.     Then,   provided  that     k^,  =  o   ([N/logN]        ),       we  have 
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max      1    I  r,      n1/3        .1    r,    n1/3    ,1  ,  ,s 


where     m.     is   the  midpoint  of  the     jth     cluster. 

J 

Proof: 

1/3 
Consider  a   locally  optimal      kj^-parti tion  with     k^,  =  0  ([N/logN]        ). 

Denote  the  open  interval    (whose  boundary  points   ?i.rQ.  order  statistics) 

containing  the     jth     cluster  by     I..     Then,   as   before,  we  have 


1    ^^^^(^•)     2      4 

u.   =  /    .   X  dF  /   F(I  .)    ^  nij   .  ^  .  -^^  .   .]  .  0(e])  (2.3) 


The  proof  is  in  three  parts.  In  part  I,  it  is  shown  that  if  a  cluster  is 

of  length  bounded  by  2{B/b)'^^^/k^     and  l/(2k^),  then  both  it  and  its 

1/3 
neighboring  clusters  contain  at  least  N(logN/N)    b/16  observations. 

In  part  II,  using  the  result  of  Lemma  2,  the  relationship  between  the 

lengths  of  neighboring  clusters  is  established;  a  bound  of  the  ratio 

—^   •  (  -rr-^)  is  given  by  1  +  k./  0  (1).  Since  at  least  one  of 

e.      f(m.)         ^     -^      N   p' 

vj  J 

1/3 
the  k^,  clusters  satisfies  2(B/b)   /k|^  ^  e-  ^  l/(2k|^),  applying  parts 

I  and  II  repeatedly  gives  the  result  of  this  theorem. 

To  avoid  wordiness,  statements  are  to  be  read  as  if  they  included  the 

qualification:  "with  probability  tending  to  one  as  N  approaches  infinity' 

[I]  Suppose  that  2(B/b)  ^'''^/k^  ^  e^  ^  l/(2k^).   Then  F(Ij)  ^  b/(2k^). 
By  (2.1),  the   jth  cluster  contains  at  least  Nb/(2k|^)  -  0  (N^^^)  obser- 
vations. Therefore,  the  number  of  observations  in  the  jth  cluster 

1/3 
exceeds  Nb/4k^,  eventually.  Since  k^,  =  0  ([N/logN]   ),  this  number 

exceeds  N(logN/N)^/^  b/4. 


-8- 


Applying  Lemma  2  to  the  jth  cluster,  we  have 

\~z.   -  Ujl    <  Cq   (logN/N)^/2  e}^^,  (2.4) 

where     z-     is   the  mean  of  the  observations   in  the  jth   cluster.     Consider 
the     (j-l)st     cluster,   a   cluster  adjacent  to   the  jth  cluster.     The 
largest  observation  in   the     (j-l)st  cluster  is     x    ._ ,  and  the 

(t=0  \^' 

smallest  observation  in  the  jth  cluster  is  x  .  ,       .  Then  by 

{^z     n.  +  1) 
t=0 


local       optimality,  the  midpoint  M     between     z.   ,     and     z-     must  lii 

.   .  and     X    .    , 

(■^^     "t^  (^'^     n  +   1) 

t=0     ^  4=0  "t       '■' 


ej.i^M-   i^_^  =  5.   .M  =  i.   -  X  ^_^  .Op   ([logN/ND, 

^  t=0     ^ 

since  the  largest  gap  between  successive  order  statistics   is     0   ([logN/N] 

It   follows   from  (2.3)    that 

^J-1  '  ^'j  "  "j^  ^  hj  ^  Op([logN/N]) 
And  from   (2.4) ,  we  obtain 


-j-1   '  i^j   -   Co   [logN/N]^/2  ^_l/2  ^  g^   ([logN/Ni: 
^  |ej   ^   l/(8k^),   since     e^.   .>   l/(2kj^). 


Therefore,   it  follows   from   (2.1)    that  the     (j-l)st     interval    contains 

at   least     Nb/(8k^)   -  0p(Nl/2^   ,  N    [logN/N]^/^  b/16     observations  eventually. 


[II]     Now,   applying  Lemma  2   to  the     (j-l)st     cluster,  we  h 


ave 


^J-1   -   "j-ll    '     ^0   (1ogN/N)l/2^._^l/2.  (2.5] 


Since     M  -   I._^  =  i.   -  m,  we  have 


[nij_i  +  -^j.i  +  Op([logN/N])]   -  z._^  =  ~z.   -    [m.   -  ^^  +  Op(  [logN/N] )  ]    (2.6) 


From  (2.3),    (2.6)   can  be  written  as 

1         f^^^("ii    i)         ?  4  1  -  - 

t^j-1  -  12    •        f(ni.;j      •    'U  -  °   ^^J-l^   '  2  ^J-1^   -   ^J-1   =  ^J   - 

f^j   -  T2    •  -fT^  •   ^j   -  0(^?    -  i^j^  ^  20p([logN/N])  (2.7) 


Let     f*     be  the   density  at  the  midpoint  between     m.     and     m,-_-i-     Then 

0(ej^)]    =  ej    [fVf(mj)]"^/^   [1  +  O(ej^)   +  Op   (logN/N)], 

since  f*  =  f(mj)  -  \   f^^^m^)  e^  +  0(ej)  +  Op  ( [logN/N] )   by  the  expan- 
sion of  f  about  m.. 
Similarly,  by  the  expansion  of  f  about  ni-j_i'  we  have 

e,-itl  -  \   •  ^^(^YT^  0(ej.,)l  =  e..,[fVf(.,..,)]-^/^  1  ^  0(e^.,)  . 


2  Op(logN/N)]. 


Therefore,  (2.7)  becomes 


("j.l  -  Zj-i)  +  \  ej.i  [f*/f(mj_i)]"^^^[l  +  0(ej_^)l  =  (i,  -  uj  + 


'J    J 


\  ^j  [^'Vf(m.)]"^''^[l  +  0(ej)]  +  2  Op  (logN/N).  (2.8) 


■10- 


Combining   (2.4),    (2.5),   and   (2.8),  we  can   first  show  the  ratio     e.    ,/e. 
is  bounded,   and  then 


,'j- 


^J 


-  •    [f(mj_^)/f(mj)]^/^  -   li    ^  4  Cq   [fVf(nij)]^/^(logN/N)^/2ej-^/2 

+  2ej"^  Op(logN/N)   +  0   (e^)  (2.9) 


<  k~/    [4   /2     CQ(B/b)^/\^2/2(logN/N)^/2 
+  4k^  Op(logN/N)  +  0(kj;;^)] 


^n'  °p(^) 


and  this  bound  does   not  depend  on   the  intervals   involved. 


[Ill]      From  the  first  inequality  in   (2.9),  we  can  show  by  contradiction 

1/3 
that  at  least  one  of  the     k^     cluster  intervals  satisfies   2(B/b)       /k^ 

:j  e.   i   l/(2k^,).     Then  using   the  bound  in   (2.9)   and  carrying  on  at  most 

•J  ' ' 

k|^  comparisons  of  adjacent  clusters,  we  obtain 

(e^./e.)  •  [f(m.)/f(mj)]^/^  =  [1  +  k'^Opd)]  ^  -  1  +Op(l) 


uniformly  in     1  <  i,  j   <  k|^. 

k^ 

Since  z     e  .  f  (m  .)  ^^^  ^  /I  f(x)  ^^"^  dx  as  N  ^ -,  the  theorem 

I   J    J       ^ 

fol lows. 
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3.  ASYMPTOTIC  WITHIN-CLUSTER  SUMS  OF  SQUARES  OF  THE 
LOCALLY  OPTIMAL  SAMPLE  K-MEANS  CLUSTERS 


As   in  Section   2,   let     x    •_ -,  <...<  x      j  denote   the     n- 

(t=0  "t  '  '^  (t=o  "t) 

observations   in  the  jth   cluster  of  a   locally  optimal   k-means   partition 

of  a  sample   from     f,     and  let     z-     be   the  jth   cluster  mean.     The  within- 

cluster  sum  of  squares   of  the  jth  cluster,     WSS.,     is  defined  to  be 

"i  -     2 

Y,       (x    .   ,  -   Z-)    •      In  this   section,  we  will   show   that  the 

wi  thin-cluster  sums  of  squares   of  the     k^,     clusters   are  asymptotically 
equal.      First,   a  direct  consequence  of  Feller's   theorem  on   large   devia- 
tions  (Lemma  3)    is   used  in  the  proof  of  Lemma  4  to  obtain  a   uniform 
estimate  of  the  within-cluster  sum  of  squares,  which   is  a   function  of  the 
length  of  the  cluster  interval   and  the  density  at  the  midpoint  of  the 
interval.     The   result  of  Theorem  2  then  follows   from  Theorem  1  and  Lemma  4. 

Lemma   3: 


Let     Xp   Xp,    ...,   Xju     be  a   random  sample   from  a   distribution     F     with 
density     f    which   is  positive  and  has   four  bounded   derivatives    in    [0,1]. 
Put     B  =  Q^,JJP^    f(x)     and     b  =  q^"^^  f(x).     Denote   the     n     observations 
contained  in  an  open  interval      I     by     z, ,   z^,    .,.,   z   ,     and  let 
E[z.    -   Uj]    =  0,     E[(z.    -   Uj)^]=  Qj,     and     E[(z.    -   Uj)^]   =  -,  ^    <  -.     Then 
there  exist  constants     C*,   D*,     and  N*,     not  depending  on     I     such   that 
if     N^  N*     and     n^N(logN/N)  ^^^b/16, 


P^  (Yr^^^n"^/^    1    y     (z.   -  ~z)^  -   na?  i    >  C*(logN)^/^}   ^  D*  N"^(logN)"^/^ 


(Since  the  proof  is  similar  to  that  of  Lemma  1,  it  will  not  be  given  here, 
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Lemma  4: 

Let     X,,   x„,    ...,   Xj^     be  a   random  sample  from     F     whose  density 
f     is  positive  and  has   four  bounded  derivatives  in    [0,1],     Put 

mean  of  the  observations   in  an  open  interval    I,     WSSr     be  the  wi thin- 
cluster  sum  of  squares   of  the  observations  in     I,     m^     be  the  midpoint  of 
I,     and     Sr     be   the  size  of  the  interval      I.      Then     there  exists  a 
constant     C*     such  that 

P^  (SUP  Sj-5/>SSj   -  ^N  f(mj)sj[l  +  0(s^)]l   ^  C*  N(logN/N)^/^} 
=   1-0(1), 

where  the  sup  is   taken  over  all   open  intervals     I      (whose  boundary  points 

1/3 
are  order  statistics)    containing  at  least     N(logN/N)       b/16     observations. 

The  proof  of  this   lemma  is   similar  to  that  of  Lemma  2;    therefore, 

only   the  differences  between  the  two  proofs  are  outlined  here.      In  this 

proof,   the  Taylor  series   expansion   is  carried  out  to   the  fourth  order 

terms.     And  after  some  series  manipulations,  we  obtain 

al  =  /j(x  -   Uj)^dF/F(I)    =  ^  s^    [1  +  0(Sj)],  (3.1) 

and       Yj   =  /i(x  -   Uj)^dF/F(I)   =  y^  sj    [1  +  0(s^)].  (3.2) 


Applying  (3.1)  and  (3.2)  to  the  result  of  Lemma  3  will  give  this  lemma 

2 
because  the  number  of  possible  intervals  I  is  again  bounded  by  N  . 
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Theorem  2: 

Let     X-,,   X2J    . . . ,   x^i     be  a   random  sample   from  a  distribution     F 
whose  density     f     is   positive  and  has   four  bounded  derivatives   in    [0,1]. 
Put     B  =  Q^JJP^  f(x)     and     b=     J^J^^   f(x).     Let     VISS.   (j   =   1,   2,    ...,   k^) 

be   the  wi thin-cluster  sum  of  squares  of  the     jth     cluster  of  a   locally 

1/3 
optimal     k|^-partition  of  the  sample.     The  provide   that     k,^   =    o([N/logN]        ), 

we  have 


12N"^k„^  WSS,   -    [A   f(x)^/2dx)2    I    -    oJl) 


max 


Proof: 


1/3 
Consider  a   locally  optimal      kj^,-parti tion  with     k^  =    o([N/logN]        ). 

It  is   shown   in  Theorem  1   that  for  all      N     large  enough,  with   probability 

tending   to  one,      (i)      the  number  of  observations   in  each  cluster  exceeds 

N(logN/N)^'^\/16,     and     (ii)     e-   =  k"^   f(m.)"^^^  G[l+o  (1)],     where 

1  1/3 

G  =  /„   f(x)  dx.      From     (i),     we  can  apply  Lemma  4  to  obtain 

e."^/^    I   WSS.   -  ^Nf(m.)   e^   [l+0(e^)]    |    ^  C*  N(logN/N) ^/^,     uniformly 

in     l^j^k^,.      It  follows   from     (ii)     that,  we  have  uniformly   in     l<j<k^,. 


IWSSj   -  ^  Nk-2  G^    [l+0p(l)]|    ^  2   C*  N(logN/N)^/2^j^-5/2^-5/6g5/2^ 


Therefore, 


12N"^k^  WSSj   -  G^    I    <  Op(l)   +  C**  k^^''^    (logN/N)^/2  =   0p(l), 


where     C**  =   2C*  b     '°  G  '    .     And  the  theorem  is  proved. 
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Since  the  globally  optimal    k-means  partition  of  the  sample  is 
necessarily  locally  optimal,   the  results  of  Theorem  1  and  Theorem  2  also 
apply   to  the  globally  optimal      k^j-partition.     Moreover,   the  generaliza- 
tion of  these  results   to  densities  with  finite  support   [a,b]   is   immediate. 
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4.     DISCUSSION 

In  this   paper,   the  properties  of  univariate  sample  k-means   clusters 
when     k     approaches   infinity  with   the  sample   size     N     are  presented. 
This   study   is  of  interest  in  its  own   right  because  we  want  to  examine 
the  sampling  properties   of  this  widely-used  clustering  method.     The 
result  in  Section   2  also  indicates   that  the  k-means  method  would  parti- 
tion a  sample   from  a  distribution  with  density     f     on    [a,b]    in  such  a 
way  that  the  sizes   of  the  cluster  intervals  are  adaptive   to     f;     the 
intervals  are   large  when  the  density  is   low  while  the   intervals   are  small 
where  the  density  is   high.     Therefore,   k-means   can   potentially  be  used 
as   a  method  for  constructing   variable-cell   histograms.      Indeed,   a  histo- 
gram estimate  of     f     based  on   the  k-means  method  is  proposed  in  Wong 
(1982b);   and  it  is   shown   to  be  uniformly  consistent  in  probability. 

The  multivariate  case   requires   further  investigation  as   the  generaliza- 
tion of  univariate   results   to  many  dimensions   is   not  straightforward. 
An  important  first  step  is   to  determine   the  configuration  of  the  optimal 
population  k-means   partition  in  several    dimensions.     There   is   some 
indication  that  the  best  partition  in     R       is   into  regular  hexagons   (Wong 
1982c);   but  it   is   not  clear  that  the  best  partition  is   into   regular  poly- 
topes   for  higher  dimensions. 
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